
Keywords:

conv_i := convolution with stride i 
upconv := upsampling + conv_1
concat(i, j) := feature layer concatenation of levels i and j
IN := input feature layers
OUT := output feature layers
OUT_SIZE := output size (W x H) of the stacked images


INPUT: BATCH_SIZE x 256 x 256 x 3

1) conv_1, 5x5 kernel, IN:   3, OUT:  32, OUT_SIZE: 256x256
2) conv_2, 5x5 kernel, IN:  32, OUT:  32, OUT_SIZE: 128x128
3) conv_2, 3x3 kernel, IN:  32, OUT:  32, OUT_SIZE: 64x64
4) conv_2, 3x3 kernel, IN:  32, OUT:  64, OUT_SIZE: 32x32
5) conv_2, 3x3 kernel, IN:  64, OUT:  64, OUT_SIZE: 16x16
6) conv_2, 3x3 kernel, IN:  64, OUT:  64, OUT_SIZE: 8x8
7) conv_2, 3x3 kernel, IN:  64, OUT: 128, OUT_SIZE: 4x4
8) conv_2, 3x3 kernel, IN: 128, OUT: 128, OUT_SIZE: 2x2

CURRENT SIZE: BATCH_SIZE x 2 x 2 x 128

9) conv_1, 3x3 kernel, IN: 128, OUT: 128*2
10) conv_1, 3x3 kernel, IN: 128*2, OUT: 128*4

CURRENT SIZE: BATCH_SIZE x 2 x 2 x 512

11) upconv, 3x3 kernel, IN: 128*4, OUT: 64*4, OUT_SIZE: 4x4
    concat(11, 7),      IN:  64*4, OUT: 64x6, OUT_SIZE: 4x4
12) conv_1, 3x3 kernel, IN:  64*6, OUT: 64x4, OUT_SIZE: 4x4

13) upconv, 3x3 kernel, IN:  64*4, OUT: 64*4, OUT_SIZE: 8x8
    concat(13, 6),      IN:  64*4, OUT: 64x5, OUT_SIZE: 8x8
14) conv_1, 3x3 kernel, IN:  64*5, OUT: 64x4, OUT_SIZE: 8x8

15) upconv, 3x3 kernel, IN:  64*4, OUT: 64*4, OUT_SIZE: 16x16
    concat(15, 5),      IN:  64*4, OUT: 64x5, OUT_SIZE: 16x16
16) conv_1, 3x3 kernel, IN:  64*5, OUT: 64x4, OUT_SIZE: 16x16

17) upconv, 3x3 kernel, IN:  64*4, OUT: 32*4, OUT_SIZE: 32x32
    concat(17, 4),      IN:  32*4, OUT: 32*6, OUT_SIZE: 32x32
18) conv_1, 3x3 kernel, IN:  32*6, OUT: 32x4, OUT_SIZE: 32x32

19) upconv, 3x3 kernel, IN:  32*4, OUT: 32*4, OUT_SIZE: 64x64
    concat(19, 3),      IN:  32*4, OUT: 32*5, OUT_SIZE: 64x64
20) conv_1, 3x3 kernel, IN:  32*5, OUT: 32x4, OUT_SIZE: 64x64

21) upconv, 3x3 kernel, IN:  32*4, OUT: 32*4, OUT_SIZE: 128x128
    concat(21, 2),      IN:  32*4, OUT: 32*5, OUT_SIZE: 128x128
22) conv_1, 3x3 kernel, IN:  32*5, OUT: 32x4, OUT_SIZE: 128x128

23) upconv, 5x5 kernel, IN:  32*4, OUT: 32*4, OUT_SIZE: 256x256
    concat(23, 1),      IN:  32*4, OUT: 32*5, OUT_SIZE: 256x256
24) conv_1, 5x5 kernel, IN:  32*5, OUT:  3x4, OUT_SIZE: 256x256

FINAL SIZE: BATCH_SIZE x 256 x 256 x 12

Every convolution is followed by a bias add and a non-linearity. ReLUs are used for every layer except the last one, for which a Sigmoid is used. 
The shown architecture is the one used to derive light transport layers. The architecture used to infer normals is analogue but steps 9 and 10 are skipped and the IN and OUT values for the deconvolution part of the network need to be multiplied by 1, 2, 3 instead of 4, 5, 6 respectively.


